DNABarcodes package from Tilo BuschmannDNABarcodeCompatibility works in 2 steps:
Step1: finding compatible combinations of barcodes
To find compatible combinations of DNA barcodes, one approach is to generate all possible combinations of DNA barcodes and perform an exhaustive search of compatible combinations. However, the total number of combinations may become prohibitively large for large barcode sets and large multiplexing numbers. Thus, an exhaustive search of compatible combinations among all possible combinations may not be possible and in fact is not usually useful for such large datasets. An alternative strategy is to choose a fixed number of randomly selected combinations for finding compatible combinations. To determine the conditions for which a random search is preferable over an exhaustive search, we consider the execution time of the program as a critical parameter.
The execution time of the program depends on the size of the barcode dataset, the multiplex level and the Illumina chemistry used. We therefore test different values of these parameters to vizualise how they influence the execution time.
Table 2.2: parameters used in the simulations
| Number of simulations per condition | Barcode-set size | Multiplex level | Chemistry |
|---|---|---|---|
| 10 | 12, 18, 24, 30, 36, 42, 48 | 2, 3, 4, 5 | 1, 2, 4 |
| Number of simulations | |
|---|---|
| 840 |
The execution time of an exhaustive search increases linearly with the total number \({n \choose k}\) of combinations. The constant of proportionality corresponds to the average computation time required to check compatibility of a given combination. This time is slighty faster for the 4-channel chemistry. For the ease of use, we fix to a few seconds the time of execution. On a standard computer, this corresponds to an exhaustive search among 2000 combinations for the 4-channel chemistry. Above 2000 combinations, the program will perform a search of compatible barcodes from a randomly generated set of barcode combinations. Other important aspects also motivate this choice (see below).
The total number of combinations (compatible or not) is given by the binomial coefficient \({n \choose k}\) where \(n\) is the number of barcodes in the dataset, referred to as the barcode-set size, and \(k\) is the mutliplex level (\(n\) and \(k\) being positive integers). We now plot the execution time as function of the barcode-set size for various multiplex levels. This reveals that the execution time is less than 2 seconds for a multiplex level of 2 with the 48-barcode Illumina small RNA TruSep kits. Therefore, in these conditions, it is advantageous to perform an exhaustive search of compatible combinations for a multiplex level of 2. For a multiplex level of 4 or higher, the program will typically perform a search of compatible barcodes from a randomly generated set of barcode combinations.
Note that the probability that a randomly chosen combination of \(k\) barcodes among \(n\), which are themselves chosen randomly among the 48 barcodes of the Illumina set, is independent of \(n\), consistent with the observations. This probability is moreover fairly independent on the chemistry for this particular barcode set.
For a multiplex level of 2, the number of compatible combinations is less than 100 for the combined Illumina TruSeq small RNA kits. Therefore, for a multiplex level of 2, an exhaustive search is preferable to retrieve all possible compatible combinations of barcodes for the optimization steps. We fix to 2024 the total number of combinations beyond which a (much faster) random search of compatible combinations is performed (see dashed line in the graph below). This number is motivated by two reasons. Below this threshold, an exhaustive search will be systematically performed for a multiplex level of 2 (less than 2024 combinations in total). Above this threshold, a random search will provide enough compatible combinations for futher optimization steps. Indeed, the above results indicate that the probability to get a compatible combination is at least 18% for multiplex levels >= 3. Briefly, the random search will randomly pick up an arbitrarily fixed number of 1000 combinations representing a pool of at least 180 compatible combinations, which is enough for the further optimization steps to be efficient (see below). Thus, the DNABarcodeCompatibility::experiment_design() function of the package will use a threshold of 2024 combinations in total to switch between an exhaustive search and a random search of compatible combinations.
The frequencies of the various barcodes appearing in the compatible combinations generated above may be highly heterogeneous for small multiplex levels. We plot here the distribution of barcodes averaged over the different barcode sets, for distinct chemistries and multiplex levels. Despite the averaging of the distibutions over different barcode sets of different sizes, the heterogeneity between barcode frequencies is clearly apparent.
Individual sets of 12 barcodes were randomly generated from the four Illumina TruSeq small RNA kits comprising 48 barcodes in total. We see that barcode frequencies can be very heterogeneous even for multiplex levels of 4 or 5. Therefore, without further optimization, consumable usage may be strongly biased among preparation kits.
The goal here is to estimate the probability to get an optimal solution, in the sense of a maximum entropy solution, and to test how the optimizer improves the solution compared with a random pick.
We compare the greedy search algorithms implemented in the DNABarcodeCompatibility R-package: the greedy descent and the greedy exchange algorithms (described in supplementary file S2). We use different subsets of randomly-generated compatible combinations, and for each of those we perform 100 simulations with both algorithms, in order to estimate the probability \(P_{1}\) to get an optimal solution. This optimal solution consists of a selection of compatible barcode combinations in which the barcode distribution is close to uniform. The number of barcode appearing in the optimal solution is the product of the multiplex level by the number of lanes, which defines the number of libraries to be sequenced.
Based on this probability \(P_{1}\), we devise an optimizer function that performs \(R\) trials of the greedy search. The probability to get an optimal solution after \(R\) iterations reads \(P_{R} = 1-(1-P_{1})^R\). For example, if we want to find an optimal solution at least 95% of cases \((P_{R} = 0.95)\), we need a number of iterations \(R \geq log(1-P_{R})/log(1-P_{1})\).
Step2a: find the probability \(P_{1}\) to get an optimal combination given the subset size, and estimate \(R\) such that \(P_{R} \geq 0.95\), in order to refine the optimization in a reasonable computation time.
Step2b: check the efficacy of the optimizer function on a few challenging examples
Table 2.3.1: parameters used for the simulations of step2a
| Number of simulations per condition | Barcode-set size | Multiplex level | Chemistry | Lane number | Algorithm | Compatible combinations (GD) | Compatible combinations (GE) |
|---|---|---|---|---|---|---|---|
| 100 | 12, 18, 24, 30 | 2, 3, 4, 5 | 2, 4 | 4, 6, 8 | greedy_exchange, greedy_descent | 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100 | 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 400 |
| Number of simulations | |
|---|---|
| 3.16810^{5} |
We perform 100 repetitions of the greedy search for the conditions (described in table 2.3.1 above). When the set of compatible combinations is large, the computation time, required to run these greedy algorithms on the full set of compatible combinations generated in step1, may become prohibitively large. To reduce the computation time, we apply the greedy search on a randomly generated subset of 10 \(\leq N_{comp}\leq\) 400 compatible combinations, for a barcode set size comprized between 12 and 30, and a multiplex level between 2 and 5. Here, we want to know how the size of the subset influences the computation time for each of the greedy search algorithm.
We averaged the execution time over 100 replicates of different combinations of conditions of chemistry, barcode-set size, number of lanes, and multiplex level. Part of the variability observed for the execution time comes from the diversity of conditions that may substantially influence the execution time.
The graph below indicates that the running time of the greedy-descent algorithm is quadratic as a function of the barcode-set size, which is expected, since the number of operations required to complete this algorithm on a set of \(N_{comp}\) compatible combinations is \(\mathcal{O}(N_{comp}^2)\). In contrast, the computation time of the greedy-exchange algorithm is linear (\(\mathcal{O}(N_{comp})\)).
We perform 100 repetitions of the greedy search for the conditions (see table 2.3.1 above) to estimate the probability to reach the theoretical maximum entropy \(S_{max}\) in each condition. Here, we test how the size of the subset influences the probability to reach \(S_{max}\).
To our surprise, results are very heterogeneous among conditions. However, one common trend is that larger subsets of compatible combinations (> 50 compatible combinations) give better results. In some cases, \(S_{max}\) is not reached, which means that either the probability \(P_{1}\) is below 1% or there is not any set of compatible combination that can found. Note that both cases are possible. However, in all cases, the greedy algorithms allow us to substantially reduce, if not completely remove the heterogeneity in the frequencies of barcode usage (see below).
In some cases, the probability to reach the theoretical maximum entropy \(S_{max}\) might be \(\leq 10^{-2}\) or even be zero. Here, to take account of such cases, we estimate the probability to get the highest value of the entropy \(S_{opt}\) where possibly \(S_{opt} < S_{max}\). In these conditions, the barcode distribution won’t necessarily be uniform, but still much less heterogeneous than it would be for a selection obtained by a random pick (see below).
The probability of getting an optimal solution after \(R\) iterations reads \(P_{R} = 1-(1-P_{1})^R\). For example, if we want to find an optimal solution at least 95% of cases \((P_{R} = 0.95)\), we need a number of iterations \(R \geq log(1-P_{R})/log(1-P_{1})\).
The results shown in the graphs below are heterogeneous between conditions. However, common trends emerge. First, in general, the greedy-exchange search performs the optimization faster over R iterations than the greedy-descent search, except in a few cases (see next point) where the number \(N_{comp}\) of compatible combinations is smaller than 30. Second, when \(N_{comp}\) is larger than 120 (only tested for the greedy-exchange algorithm), the optimization requires less iterations in general and performs faster. Based on these observations, it seems reasonable to fix the compatible-combination subset size to 120.
In the graphs below, data are now pooled regardless chemistry and barcode-set sizes. The number displayed in the right grey panel represents the size the randomly chosen subset of compatible combinations. We confirm that the greedy-descent search performs faster for small subsets of compatible combinations. For subsets larger than 30 compatible combinations, it seems that 50 iterations would be a reasonable number for an efficient and fast optimization by the greedy-exchange search.
To visualize the efficacy of the greedy search algorithms, we compare the distribution of entropy values obtained from randomly generated selections of compatible barcodes with the distribution of entropy values obtained from optimized selections of compatible barcodes. The graphs below show that both greedy algorithms are comparable and dramatically reduce the barcode frequencies’ heterogeneity by favoring selections of compatible barcode combinations with a higher entropy. One graph table is shown per subset size. The blue line indicates the position of the theoretical value of the maximum entropy \(S_{max}\). A dashed line indicates that \(S_{max}\) has been reached, whereas a solid line indicates that \(S_{max}\) has not been reached.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
The results of the simulations in step2a suggest a few cases for which the optimization procedure would require more than 100 iterations to reach \(S_{max}\). It is instructive to use these challenging conditions to test how good our greedy-exchange algorithm would perform compared to a basic optimization by iterative random picks. In the table below are summarized a few challenging conditions to be tested.
| Number of simulations per condition | Barcode-set size | Multiplex level | Chemistry | Lane number | Algorithm | Compatible combinations (GD) | Compatible combinations (GE) | Max iterations R |
|---|---|---|---|---|---|---|---|---|
| 100 | 12, 30 | 4, 5 | 4 | 6 | greedy_exchange | 120 | 200 |
| Number of simulations | |
|---|---|
| 400 |
We compare the distributions of entropy values obtained using 3 different strategies over a maximum of 200 iterations per trial. We performed 100 trials for each strategy. We calculate and compare the 3 resulting entropy values:
In the graphs below, the blue line indicates the position of the theoretical value of the maximum attainable entropy \(S_{max}\). Solid lines indicate that \(S_{max}\) has not been reached over 200 iterations. Either the probability to find \(S_{max}\) is too low to be reached in 200 iterations, or \(S_{max}\) cannot be reached at all due to constraints in the barcode set. Furthermore, the dashed line indicates that \(S_{max}\) has been reached.
In all cases, the greedy search algorithm performs better than a random search, and reaches a value close to the best entropy value.
## [[1]]
| Conditions | Set of 12 barcodes | Set of 30 barcodes |
|---|---|---|
| mplex-4 | 245 sec | 253 sec |
| mplex-5 | 1 sec | 291 sec |
As above \(n\), \(k\) and \(a\) denote the barcode set size, the multiplex level, and the number of lanes, respectively
| Barcode set | |
|---|---|
| RPI03, RPI04, RPI06, RPI07, RPI08, RPI09, RPI11, RPI12, RPI14, RPI15, RPI16, RPI17, RPI18, RPI21, RPI22, RPI23, RPI27, RPI31, RPI32, RPI33, RPI34, RPI35, RPI36, RPI37, RPI38, RPI40, RPI43, RPI46, RPI47, RPI48 |
We use the same experimental conditions as above for a fair comparison. Here, the number of libraries \(N=a.k=24\) is smaller than the barcode set size (\(n=30\)). The optimal selection has a uniform distribution of the barcodes used (of entropy \(S_{max} = log(N)\)), and is reached by the greedy algorithms.
| Barcode set | |
|---|---|
| RPI03, RPI04, RPI06, RPI07, RPI08, RPI09, RPI11, RPI12, RPI14, RPI15, RPI16, RPI17, RPI18, RPI21, RPI22, RPI23, RPI27, RPI31, RPI32, RPI33, RPI34, RPI35, RPI36, RPI37, RPI38, RPI40, RPI43, RPI46, RPI47, RPI48 |
We use the same experimental conditions as above for a fair comparison. Here, the number of libraries \(N = a.k = 30\) is equal to the barcode set size \(n\), and the maximum possible value of the entropy of the barcode selection is the entropy of the uniform distribution, \(S_{max}=log(n)\). Such a selection, exhausting the set of available barcodes, would correspond to a partition of this set in \(a=6\) disjoint combinations of \(k=5\) barcodes. However, no such partition exists for the set of compatible combinations at hand, which allows one to form selections containing a maximum of 29 barcodes. Hence, a uniform barcode distribution cannot be attained. Nevertheless, the greedy algorithm still finds a selection having the maximum attainable entropy, \(S_{opt} = 3.35\), corresponding to the value of \(S_{max}\) obtained (Eq. 1 in Supplementary File S2) for \(N = 30\) and \(n = 29\).
| Barcode set | |
|---|---|
| RPI01, RPI06, RPI07, RPI10, RPI15, RPI16, RPI28, RPI31, RPI41, RPI42, RPI46, RPI47 |
We use the same experimental conditions as above for a fair comparison. Here, the number of libraries (\(N=a.k=30\)) is greater than the barcode set size (\(n=12\)). A selection with uniform barcode distribution cannot be reached, however, because \(N\) is not divisible by \(n\). However, the selection having the maximum possible entropy \(S_{max}\) is reached.
| Barcode set | |
|---|---|
| RPI01, RPI06, RPI07, RPI10, RPI15, RPI16, RPI28, RPI31, RPI41, RPI42, RPI46, RPI47 |
We use the same experimental conditions as above for a fair comparison.
DNABarcodes package from Tilo BuschmannWe use the DNABarcodes package from Tilo Buschmann to generate a new set of 6-mer barcodes, which are robust against at least one substitution error (Hamming distance of 3). In addition, barcodes including stretches of 4 or more nucleotides (homopolymer of length >3) are discarded as well as those that have an unbalanced ratio of bases G or C versus A or T. We then exclude barcodes that are present both in the Illumina TruSeq small RNA kits and in the newly generated set.
library("DNABarcodes")
## Loading required package: Matrix
## Loading required package: parallel
newBarcodesRaw <- create.dnabarcodes(n=6, dist=3, metric = "hamming", filter.triplets = F, filter.gc = T, filter.self_complementary = F)
## 1) Creating pool ... of size 1280
## 2) Conway closing... done
newBarcodes <- newBarcodesRaw[!(newBarcodesRaw %in% DNABarcodeCompatibility::IlluminaIndexes$sequence)]
newBarcodeSet <- data.frame(Id=paste0("B",sprintf("%02d",1:length(newBarcodes))), sequence=newBarcodes, GC_content=50, homopolymer=FALSE, stringsAsFactors = F)[1:48,]
newBarcodeSet[,1:2]
## Id sequence
## 1 B01 GGGAAA
## 2 B02 CCCAAA
## 3 B03 CGAGAA
## 4 B04 ACGGAA
## 5 B05 GACGAA
## 6 B06 GCACAA
## 7 B07 CAGCAA
## 8 B08 AGCCAA
## 9 B09 TCGAGA
## 10 B10 GTCAGA
## 11 B11 CGTAGA
## 12 B12 TGACGA
## 13 B13 ATGCGA
## 14 B14 GATCGA
## 15 B15 CCATGA
## 16 B16 CTGACA
## 17 B17 TGCACA
## 18 B18 GCTACA
## 19 B19 TCAGCA
## 20 B20 ATCGCA
## 21 B21 CATGCA
## 22 B22 GGATCA
## 23 B23 TGGGTA
## 24 B24 TCCCTA
## 25 B25 GCGTTA
## 26 B26 CGCTTA
## 27 B27 TAGGAG
## 28 B28 AGTGAG
## 29 B29 TTCCAG
## 30 B30 CTGTAG
## 31 B31 ACCTAG
## 32 B32 GAAAGG
## 33 B33 ACTAGG
## 34 B34 TTTGGG
## 35 B35 AGATGG
## 36 B36 TACTGG
## 37 B37 AAGACG
## 38 B38 TAACCG
## 39 B39 ATTCCG
## 40 B40 GATTCG
## 41 B41 CGAATG
## 42 B42 GTGATG
## 43 B43 AACCTG
## 44 B44 TGTCTG
## 45 B45 CCTTTG
## 46 B46 TCTGAC
## 47 B47 CTACAC
## 48 B48 GAGTAC
| Number of simulations per condition | Barcode-set size | Multiplex level | Chemistry |
|---|---|---|---|
| 10 | 12, 18, 24, 30, 36, 42, 48 | 2, 3, 4, 5 | 1, 2, 4 |
| Number of simulations | |
|---|---|
| 840 |
For a multiplex level of 2, the number of compatible combinations is even smaller than for the combined Illumina TruSeq small RNA kits. Therefore, for a multiplex level of 2, an exhaustive search is preferable to retrieve all possible compatible combinations of barcodes for the optimization steps. The dashed line in the graph below indicates the threshold (2024) over which the DNABarcodeCompatibility::experiment_design() function of the package will perform a random search of compatible combinations instead of an exhaustive search.
The frequencies of the various barcodes appearing in the compatible combinations generated above may be highly heterogeneous for small multiplex levels. We plot here the distribution of barcodes averaged over the different barcode sets, for distinct chemistries and multiplex levels. Despite the averaging of the distibutions over different barcode sets of different sizes, the heterogeneity between barcode frequencies is clearly apparent.
| Number of simulations per condition | Barcode-set size | Multiplex level | Chemistry | Lane number | Algorithm | Compatible combinations (GD) | Compatible combinations (GE) |
|---|---|---|---|---|---|---|---|
| 100 | 12, 18, 24, 30 | 3, 4, 5 | 2, 4 | 4, 6, 8 | greedy_exchange | 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300 |
| Number of simulations | |
|---|---|
| 1.0810^{5} |
In some cases, the probability to reach the theoretical maximum entropy \(S_{max}\) might be \(\leq 10^{-2}\) or even be zero. Here, to take account of such cases, we estimate the probability to get the highest value of the entropy \(S_{opt}\) where possibly \(S_{opt} < S_{max}\). In these conditions, the barcode distribution won’t necessarily be uniform, but still much less heterogeneous than it would be for a selection obtained by a random pick (see below).
The probability of getting an optimal solution after \(R\) iterations reads \(P_{R} = 1-(1-P_{1})^R\). For example, if we want to find an optimal solution at least 95% of cases \((P_{R} = 0.95)\), we need a number of iterations \(R \geq log(1-P_{R})/log(1-P_{1})\).
The results shown in the graphs below are heterogeneous between conditions. However, common trends emerge. First, in general, the greedy-exchange search performs the optimization faster over R iterations than the greedy-descent search, except in a few cases (see next point) where the number \(N_{comp}\) of compatible combinations is smaller than 30. Second, when \(N_{comp}\) is larger than 120 (only tested for the greedy-exchange algorithm), the optimization requires less iterations in general and performs faster. Based on these observations, it seems reasonable to fix the compatible-combination subset size to 120.
In the graphs below, data are now pooled regardless chemistry and barcode-set sizes. The number displayed in the right grey panel represents the size the randomly chosen subset of compatible combinations. We confirm that the greedy-descent search performs faster for small subsets of compatible combinations. For subsets larger than 30 compatible combinations, it seems that 50 iterations would be a reasonable number for an efficient and fast optimization by the greedy-exchange search.
To visualize the efficacy of the greedy search algorithms, we compare the distribution of entropy values obtained from randomly generated selections of compatible barcodes with the distribution of entropy values obtained from optimized selections of compatible barcodes. The graphs below show that both greedy algorithms are comparable and dramatically reduce the barcode frequencies’ heterogeneity by favoring selections of compatible barcode combinations with a higher entropy. One graph table is shown per subset size. The blue line indicates the position of the theoretical value of the maximum entropy \(S_{max}\). A dashed line indicates that \(S_{max}\) has been reached, whereas a solid line indicates that \(S_{max}\) has not been reached.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]